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Abstract 

The Lagrangian vortex method offers an alternative numerical approach for direct numerical simulation of turbulence. The 
fact that it uses the fast multipole method (FMM) — a hierarchical algorithm for A/^-body problems with highly scalable parallel 
implementations — as numerical engine makes it a potentially good candidate for exascale systems. However, there have been few 
validation studies of Lagrangian vortex simulations and the insufficient comparisons against standard DNS codes has left ample 
room for skepticism. This paper presents a comparison between a Lagrangian vortex method and a pseudo- spectral method for 
the simulation of decaying homogeneous isotropic turbulence. This flow field is chosen despite the fact that it is not the most 
favorable flow problem for particle methods (which shine in wake flows or where vorticity is compact), due to the fact that it is 
^ ideal for the quantitative validation of DNS codes. We use a 256^ grid with 7?^^ = 50 and 100 and look at the turbulence statistics, 
including high-order moments. The focus is on the efl'ect of the various parameters in the vortex method, e.g., order of FMM series 
expansion, frequency of reinitialization, overlap ratio and time step. The vortex method uses an FMM code (exaFMM) that runs on 
fS| GPU hardware using CUDA, while the spectral code (hit 3d) runs on CPU only. Results indicate that, for this application (and 

with the current code implementations), the spectral method is an order of magnitude faster than the vortex method when using a 

^ single GPU for the FMM and six CPU cores for the FFT. 
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' 1. Introduction 

The simulation of homogeneous isotropic turbulence is one 
of the most demanding benchmarks for computational fluid dy- 
namics. The phenomenon of turbulence itself is a grand chal- 
lenge problem, of crucial importance in many applications. At- 
mospheric phenomena, combustion physics, aerodynamics of 
high-speed vehicles, and transport of pollutants are only a few 
examples. Given the difficulty of capturing a wide range of 
physical scales, turbulence simulations must rely on parallel 
computing, and even so are yet unable to reach the large values 
of Reynolds numbers that are encountered in many industrial 
settings without the use of turbulence models. Direct numeri- 
cal simulations that are used to generate data for investigating 
the nature itself of turbulence, are almost always conducted us- 
ing the pseudo- spectral method. In this method, the numerical 
engine is the fast Fourier transform and thus scalability of tur- 
bulence simulations in parallel computers is directly dictated by 
the FFT algorithm. 

The largest direct numerical simulation of isotropic turbu- 
lence to date has been performed on a cubic grid of size 4096^ . 
The first work to reach that record problem size calculated tur- 
bulence at R;i ^ 1200 [20, 13] on the Earth Simulator — a large 
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vector machine that can efficiently perform large-scale FFT. 
This record has not been broken even though the peak per- 
formance of supercomputers has increased nearly 50-fold since 
then. The record 4096^ grid size has been matched in the US 
recently [10], with simulations running on 16 thousand proces- 
sors of the Ranger system (Sun Constellation Linux cluster). 
But as far as we know, it has not been surpassed. 

As the high-performance-computing community is pushing 
to achieve exascale (10^^ operations per second), it is impor- 
tant to ask how and which algorithms will be able to scale in 
future systems. FFT is an algorithm that requires communica- 
tion among all processes involved in the computation, and this 
is the limiting factor for its scalability to large systems. In a re- 
cent feasibility study, Gahvari and Gropp [11] conclude that the 
bandwidth that would be needed for FFT at exascale rules out 
mesh or torus networks, and only fat- tree or hypercube inter- 
connects would be feasible. This poses significant constraints 
for future high-performance-computing systems. Therefore, it 
is becoming increasingly important to look at alternative algo- 
rithms that may achieve better performance on the extremely 
parallel machines of the future. 

Most standard methods of incompressible CFD require the 
solution of a Poisson-type equation, this being the most expen- 
sive part of the calculation, in terms of runtime. In addition 
to standard sparse linear solvers (such as multigrid) and FFT- 
based algorithms, there are some formulations which allow a 
solution via the fast multipole method, FMM. This use of the 
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FMM, best known as an A/^-body solver, has not gained much 
traction due to both its perceived complexity to code and use, 
and its much longer runtimes compared to FFT. The runtime 
advantage of FFT, however, may be trumped by parallel scala- 
bility at the extreme scales of future systems. Our current effort 
is part of an ongoing program of research into the role of FMM 
in the computational-science ecosystem at the exascale. 

We selected homogeneous isotropic turbulence in a periodic 
box as a test case, and apply a fast multipole vortex method 
for direct numerical simulation. There have been only a few 
works testing vortex methods with the benchmark of homoge- 
neous isotropic turbulence, partly due to their comparative inef- 
ficiency for solving this particular flow. Cottet et al. [9] used the 
vortex-in-cell method, which is a semi-Lagrangian method that 
relies on both particles and meshes and interpolations of field 
values between them. They compared with a spectral method 
for N - 128^ grid points at Rex - 98 and showed good agree- 
ment between the two methods in the evolution of the energy 
spectrum, kinetic energy, dissipation, enstrophy and skewness. 
Yokota et al. [24] compared a pure Lagrangian (mesh-free) vor- 
tex method against a pseudo- spectral method for N - 128^ at 
Rex - 25 and 50. They showed quantitative agreement between 
the vortex method and spectral method for the decay rate of ki- 
netic energy and energy spectrum, and also higher-order turbu- 
lence statistics. That work was aimed at investigating diff'er- 
ent schemes for diff'usion and also studied the eff'ect of periodic 
boundary conditions on the FMM. 

Spectral methods are based on FFT while vortex methods 
rely on the FMM to achieve high performance. On a small 
number of CPUs and for the same problem size, FFT can be 
orders of magnitude faster than FMM [9]. However, in mas- 
sively parallel systems, and especially with the help of GPU 
hardware, the diff'erence in runtime between these two numer- 
ical algorithms becomes less important. We aim to investigate 
the range of problem sizes and the scale of parallelism that 
will make these algorithms comparable in time-to-solution. Of 
course, in the application to fluid turbulence in particular, there 
are other advantages of spectral methods over vortex methods. 
Spectral methods ofl'er exponential convergence, and there are 
arguments against comparing high-order methods with lower- 
order methods at the same problem sizes (we are using a vortex 
method that is second-order accurate). Despite this very rea- 
sonable objection, we persist in using this benchmark because 
it is the accepted standard for DNS of turbulence, and provides 
a methodical approach to validation of the FMM-based vortex 
method. The goal of this paper is to provide evidence that the 
vortex method, accelerated with the FMM algorithm and GPU 
hardware, is indeed a proper tool for DNS. This needs to be 
established before we can make further progress exploring the 
scalability advantages of the FMM as a numerical engine in the 
post-petascale era. 

In this paper, we compare simulations using a pseudo- 
spectral DNS code and a vortex method DNS code. We used 
the hitSD pseudo-spectral DNS code developed at the Center 
for Turbulence Research of Stanford University [8]. This code 
is parallel on CPU clusters using MPI and relies on the f f tw3 
library. Our vortex method uses a highly parallel FMM library 



for GPUs that we have recently developed, called exaFMM. The 
performance of our FMM has appreciably increased from our 
previous studies, and we have added significant new function- 
ality, such as auto-tuning [21]. The comparisons are made un- 
der the same test conditions: same Taylor-scale Reynolds num- 
ber, Rex\ same discretization parameters. Ax and A^; and same 
initial/boundary conditions. We perform a systematic survey 
of the eff'ect of the various parameters that aff'ect the accu- 
racy and speed of vortex method simulations — order of FMM 
series expansion frequency of particle reinitialization, tol- 
erance of the RBF solver, A^, overlap ratio /z/cr, and Rex — , 
looking not only at the kinetic energy spectrum, but also the 
high-order turbulence statistics (skewness and flatness of ve- 
locity derivatives). Obtaining good agreement with the spectral 
method on the high-order statistics is a considerable challenge, 
since velocity derivatives require a higher spatial resolution to 
be calculated accurately [18]. We show that the FMM is a sat- 
isfactory numerical engine for the direct numerical simulation 
of fluid turbulence, despite the inherent approximations in the 
algorithm and even when using single precision on the GPU 
hardware. This supports the case that FMM as a numerical en- 
gine will gain increasing importance, given its favorable paral- 
lel scalability and computational intensity that make it suitable 
for many-core architectures. 

The present work focuses on validating an FMM-based vor- 
tex method code on GPU, using a spectral method code as refer- 
ence and looking at high-order turbulence statistics. In another 
publication [27], we focus on the performance on massively 
parallel systems of the vortex method, compared to the spectral 
method. That work included simulations of isotropic turbulence 
on up to 4096 GPUs, with a 4096^ problem size and exceeding 
1 Pflop/s of sustained performance. 

2. Numerical methods for turbulence 

2.7. Pseudo-spectral method 

The reference numerical method used in this work is a 
pseudo-spectral method with primitive-variable formulation. 
Pseudo- spectral methods have for decades been the preferred 
method for computing isotropic fluid turbulence, and their re- 
sults are trusted. Therefore, a quantitative comparison of the 
statistics of turbulence — kinetic energy decay, energy spectrum, 
high-order velocity statistics — will reveal the accuracy of our 
vortex method. We have used the open-source hit 3d code for 
homogeneous isotropic turbulence of an incompressible fluid 
in 3D. This code uses f f tw3 and is implemented in parallel for 
CPU clusters using MPI (see Acknowledgements for a link to 
the code project). 

The initial condition for all runs was generated with a sepa- 
rate code using the method described in section 4.2, and given 
as an input file to hit 3d. The original initial condition gen- 
erated by the hit3d code has a fully developed energy spec- 
trum, where the dissipation at the high wave numbers is in 
equilibrium with the energy transfer between the wave num- 
bers. The time evolution of the velocity derivative skewness 
and flatness in this case is very difl'erent from that in similar 



studies [9, 23, 24], and is not suitable for validating the vortex 
method's ability to predict the initial evolution of high-order 
turbulence statistics. For this reason, we have generated the ini- 
tial velocity field in the same way as Yokota et al. [24] and used 
that as an input to hit 3d. 



2. 2. Vortex methods for direct numerical simulation 



The term "vortex method" is used in the literature some- 
what loosely referring to one of several Lagrangian and 
semi-Lagrangian methods based on either the vorticity- 
streamfunction or vorticity-velocity formulation of the Navier- 
Stokes equation. All vortex methods obtain the convection term 
of the equation in a Lagrangian manner, and are not constrained 
by the traditional limitations of Eulerian methods in regards to 
the CFL condition, numerical difflision and dispersion. This 
feature alone allows vortex methods to use time- step sizes that 
are an order of magnitude larger than other methods treating 
convection explicitly [15]. The diff'erentiating feature of various 
vortex methods is the way in which they obtain the viscous term 
and the stretching term of the Navier-Stokes equation in vortic- 
ity formulation. One variant, called vortex-in-cell or particle- 
mesh vortex method, performs the calculation of viscosity and 
stretching terms on a grid, and continuously interpolates quanti- 
ties back and forth between particles and grid. The first work to 
undertake the simulation of homogeneous isotropic turbulence 
with a vortex method used an FFT-based Poisson solver to ob- 
tain the stream function on a grid [9] . The velocity and vortex 
stretching can then be obtained using finite-diff'erence formulas; 
Cottet et al [9], for example, use fourth-order centered diff'er- 
ences. The early vortex-in-cell method [7] used low-order in- 
terpolation schemes, which were too numerically diff'usive, but 
modern versions utilize high-order interpolation and have been 
assessed and compared well with finite-diff'erence and spectral 
methods [9, 19]. 

The other main variant of vortex methods maintains the grid- 
free character of the general approach by means of N-bo&y 
solvers to obtain the particle velocity and the vortex stretch- 
ing term on the particle locations. The first attempt to simu- 
late homogeneous isotropic turbulence with the grid-free vor- 
tex method used a single-CPU code on an A/^ = 128^ problem 
with Rex = 25, 50 [24]. That work examined the eff'ect of dif- 
ferent viscosity schemes, the number of periodic image boxes 
in the FMM, and the eff'ect of the series truncation parameter, 
p. The main challenge for these grid- free vortex methods is the 
relatively high calculation cost of the FMM, when compared to 
fast Poisson solvers using multigrid methods or FFT. However, 
recent studies on large GPU clusters suggest that the high paral- 
lel scalability of the FMM becomes an advantage over the FFT 
algorithm when thousands of MPI processes and GPU acceler- 
ation are used [27] . 

The present vortex method solves the velocity Poisson equa- 
tion, which is derived from mass conservation. This is solved 
along with the vorticity equation, which is derived from mo- 



mentum conservation. The governing equations are: 

V^u = -V X 6^, (1) 

— -\-u-Wci) = Ci}-Wu-\- vW^cj, (2) 
dt 

where u is the velocity vector, o) is the vorticity vector, and v 
is the kinematic viscosity. The velocity Poisson equation (1) 
can be formulated as an integral equation using Green's meth- 
ods, which becomes an A/^-body problem that can be efficiently 
solved using the FMM. The vorticity equation (2) is solved by 
updating the convection, stretching, and diff'usion terms sepa- 
rately. This does not imply a fractional- step method since all 
three terms can be expressed as ordinary diff'erential equations 
updating separate variables — the particle positions, x, for con- 
vection; the particle strengths, 7, for vortex stretching; and the 
particle width, cr, for diff'usion — and the updates happen simul- 
taneously. 

The Lagrangian discretization is based on moving Gaussian 
basis functions, where the total vorticity field is expressed by 
their superposition as 
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(3) 



with yj the vortex strength of each basis function (interpreted 
as a particle of vorticity). The Gaussian basis function with 
standard deviation cr is defined by 
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where r^y is the distance between point / and point j. The inte- 
gral form of Equation (1), also known as the Biot-Savart equa- 
tion, can be written as 



(5) 



where G = 1 /4;rr/y is the Green's function of the Laplace kernel 
and 



gcr = erf 



r2 
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-^exp -T^ (6) 



is the cutoff' function, derived by integrating the Gaussian dis- 
tribution in (4) while considering radial symmetry. 

The vorticity equation (2) expresses the simultaneous con- 
vection, stretching and diff'usion of vorticity. In the Lagrangian 
vortex method formulation, the spatial discretization using 
Gaussian basis functions results in a system of ordinary diff'er- 
ential equations, which can be integrated explicitly. The con- 
vection term is accounted for by integrating the position coor- 
dinates of vortex particles according to the local velocity. The 
Eulerian convection term, u • Va>, is equivalent to a Lagrangian 
convection of the Gaussian basis functions. 



dt 



(7) 



Unlike Eulerian methods, which aker the value of vorticity on 
the mesh to account for the effect of convection, Lagrangian 
methods simply move the point without changing the value as- 
sociated to it. Lagrangian convection of basis functions com- 
bined with reinitialization or remeshing schemes results in a 
convergent numerical method, as has been shown with semi- 
Lagrangian methods. For the vortex method, Leonard [14] 
proved that Lagrangian convection of Gaussian bases in the 
vortex method result in a second-order truncation error for the 
convection term (this truncation error is known as "convection 
error" in the vortex method literature). 

The stretching term, Vu, is solved by substituting the Biot- 
Savart equation (5) into u, and the Gaussian basis function (3) 
into a>, yielding 



^ = |^v(r,xVGg.).r,. 



This update occurs in a Lagrangian frame since the Lagrangian 
convection is happening simultaneously. 

Finally, the diffusion term, yV^6>, can be calculated by 
changing the variance of the Gaussian distribution according 
to 



dcr^ 



= 2y. 



(9) 



Since the Gaussian distribution is an exact solution of the diffu- 
sion equation, spreading the distribution/variance of the Gaus- 
sian basis function is equivalent to solving the diffusion equa- 
tion exactly, as long as cr remains below a certain threshold. In 
fact, the core spreading method is an exact solution to the lin- 
earized Navier- Stokes equation (i.e., the convection-diffusion 
equations) in a Lagrangian frame with a constant flow field. The 
method is convergent as long as cr is small. Rossi [17] analyzed 
rigorously the residual of the core spreading method and proved 
convergence by showing boundedness of the solution and that 
the residual tends to zero with the numerical parameters. 

In order to keep cr from growing indefinitely, we perform a 
radial basis function (RBF) interpolation to smaller Gaussian 
distributions. This technique is known to achieve higher accu- 
racy than particle strength exchange methods with remeshing 
[1, 2, 24]. RBF interpolation is obtained by solving a linear 
system for Equation (3), with y as the unknown vector and a> 
as the right-hand side. We used a GMRES method, where the 
matrix-vector multiplications are done in matrix-free form by 
calculating Equation (3). Since the Gaussian function decays 
rapidly, we use the EMM neighbor list to calculate Equation 
(3) between neighboring particles only. For an overlap ratio of 
h/cr = 1, where h is the average distance between particles, 
the RBF system is well conditioned and converges in 5-10 it- 
erations [3, 22]. The six EMM kernels and the matrix- vector 
multiplication in RBF interpolation are done on the GPU. In the 
current implementation, the tree construction, update of parti- 
cles, GMRES outer iteration, and vortex method time integra- 
tion (forward Euler) are all performed on the CPU. 



3. Fast multipole method, FMM 

3.1. FMM for vortex methods 

The Biot-Savart equation (5) and the stretching term (8) are 
A/^-body problems where the effect of all particles must be calcu- 
lated against each other. Thus, they appear to require O(N^) op- 
erations for N particles, but with FMM the complexity is 0(N). 
The FMM is based on approximation of the Green's function 
by multipole and local expansions. For example, the Green's 
function for the Laplace kernel can be approximated by the fol- 
lowing multipole expansion 



j-l n-O m--n 



(8) and also by the local expansion 
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(11) 

where p is the order of expansion. The definition of variables 
follows the nomenclature of Cheng et al. [6]. In these equa- 
tions, the location of particle / with respect to the center of ex- 
pansion is expressed in spherical coordinates using (r/,^/,0/), 
and the location of particle j is (pj^aj^jSj)', Y^(0,(p) are the 
spherical harmonic functions. We define the operators Si, Mj, 
Ri, Lj as shown in Equations (10) and (11). Using these opera- 
tors. Equation (5) can be approximated by 
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(13) 



n-0 m--n I j-l 



Note that these equations are used to calculate the far field, 
where the cutoff function go- ^ I, and can be omitted from the 
calculation. Similarly, Equation (8) can be approximated by 
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The near field, on the other hand, is calculated by solving Equa- 
tion (5) exactly. For more details, in particular explaining the 
method to obtain periodic boundary conditions, see previous 
publications [23, 24]. The basic idea for periodic FMM is to 
place periodic images around the original domain and use mul- 
tipole expansions to calculate their influence. These periodic 
images can be accounted for with very few multipole expan- 
sions and little computational effort, and the calculation time 
required does not depend on the number of particles. 



3.2. Hybrid treecode-FMM with auto-tuning 

The FMM algorithm that we used in previous work [23], sim- 
ilarly to all other authors, required precise tuning of the pa- 
rameters every time we changed the desired accuracy or hard- 
ware. For example, for a given problem and hardware, the 
number of levels in the tree that results in optimal runtime 
varies. One of the most useful modifications of the algorithm, 
which could aid its wider adoption by computational scientists, 
is eliminating the need for parameter selection and tuning. Ar- 
guably, parameter auto-tuning is the most important feature that 
makes scientific software libraries successful. We have also ob- 
served in a separate work [26] that the 0{N log N) treecode ex- 
hibits more acceleration compared to the 0{N) FMM when us- 
ing GPU hardware. This suggested that treecodes could gain 
over the complexity advantage of FMM by means of hardware 
acceleration — contrary to the common wisdom that "complex- 
ity trumps hardware", as eloquently stated by Board and Schul- 
ten [5]. On the other hand, Cheng et al. state that "a properly 
implemented FMM [...] always selects the least expensive op- 
tion" [6]. Inspired by all these considerations, we recently de- 
veloped a novel treecode-FMM hybrid algorithm with the ca- 
pability of auto-tuning the parameters [21]. 

The hybrid treecode-FMM uses a generic and flexible 0{N) 
algorithm for traversing the tree, which can handle cases where 
the target particles and source particles are difl'erent. This fea- 
ture is useful for calculating the velocity on a lattice that is in- 
duced by scattered, Lagrangian particles (e.g., during RBF in- 
terpolation). The traversal is based on a stack data structure, and 
allows the interactions in the algorithm to be of cell-cell or cell- 
particle type, while at the same time automatically choosing 
the number of particles per box at the deepest levels of the tree, 
without user input. See details in our recent publication [21]. 

4. Homogeneous isotropic turbulence simulations 

4.1. Calculation conditions 

The flow field of interest consists of decaying homogeneous 
isotropic turbulence in a periodic box. The calculation domain 
is a cube of size [-n, tt]^, and the number of calculation points 
was = 256^ = 16, 777, 216 for both the vortex method and 
spectral method. For the vortex method calculation, we stud- 
ied the eff'ect of various parameters, as described in the next 
section. The base parameters are set to the following values, 
unless otherwise noted. 

• Order of FMM expansion : = 10 

• Number of periodic images :3^x3^x3^-l 

• Drop tolerance of the Krylov solver : 10"^ 

• Frequency of reinitialization : every 10 steps 

• Temporal resolution : A^ = 0.005 (using Euler method) 

• Taylor- scale Reynolds number : Re^ = 50 

• Spatial resolution : N = 256^ 



The same temporal and spatial resolution was used for the spec- 
tral method, and A^ was set to the maximum value that the spec- 
tral method can calculate stably. We will show later that the La- 
grangian vortex method remains stable and accurate for larger 
values of At. 

The vortex method calculations were run on a single GPU 
using single precision (64 threads per thread-block), while the 
spectral method calculations were performed on all six cores 
of a multi-core CPU using local MPI processes (and no GPU 
acceleration). Note that for a value of p = 10 in the FMM 
we obtain 4 significant digits of accuracy in the velocity and 
stretching calculations, which is lower than single-precision ac- 
curacy. Accordingly, we observed no appreciable change in the 
results when running in double precision. The spectral method 
code hit 3d does not have the feature to allow testing in single 
precision, and so we can only run it in double. This should be 
kept in mind when we report runtimes, below. 

The system used consists of Intel Nehalem-generation CPUs 
(Intel® Xeon® E5650 2.66GHz), and we used NVIDIA C2070 
Fermi GPUs. With these hardware specifications, the vortex 
method takes approximately 10 seconds per time step on one 
GPU chip, while the spectral method takes about 1 second per 
time step on a single CPU socket (6 cores). However, as the 
number of processes is increased the diff'erence between the 
runtime of the vortex method and spectral method decreases 
[27]. We used an expansion order = 6 for the timings, since 
it has been confirmed that this yields sufficient accuracy to cal- 
culate high-order turbulence statistics, as we will show later. 
Using p = 6 instead of the default value p = 10 reduces the 
calculation time from 20 seconds to 10 seconds. 

4.2. Initial condition 

For the reasons mentioned in Section 2.1, we do not use the 
initial condition provided by hit 3d, but generate our own ini- 
tial condition and use it as an input to hit 3d (and to the vortex 
method code). 

The initial velocity field was given a prescribed energy spec- 
trum of 

E^k'exp\-^^. (16) 

The velocity field was generated in Fourier space as a solenoidal 
isotropic field with random phases, and transformed to physical 
space [16], so that the resulting velocity field would have a zero 
mean and a Gaussian distribution in the fluctuation. The peak 
wave number of the prescribed energy spectrum is = 4. We 
use this velocity field in Fourier space as the initial condition 
for the pseudo-spectral method. 

The vortex method requires the information of coordinates 
X, vortex strength y, and core radius cr. These values are cal- 
culated in the following manner. First, the coordinates for the 
vortex elements are obtained from their chosen locations in- 
between the grid points of the spectral method. For example, 
if the grid points of the spectral method were at the corners of 
a box, the vortex elements are placed at the center of this box. 
The vorticity at these points is calculated from the initial veloc- 
ity field using a fourth-order central difference scheme. Then, 



the core radius of the elements is set to cr = h, where h is the 
distance between the vortex elements/grid points, resulting in 
an overlap ratio of h/cr = 1. Finally, the vortex strength is 
calculated by RBF interpolation with the same Gaussian basis 
function that is used for the vortex elements with cr = h. It is 
common in vortex methods to use a smaller overlap ratio, say 
h/cr ^ 0.8, but the homogeneity of the flow in isotropic tur- 
bulence allows us to use this relatively large overlap ratio (see 
evidence presented in section 5.5). This is beneficial because 
it reduces the ill-conditioning of the RBF matrix, and thus in- 
creases the speed of our RBF interpolation process. 

5. Results and discussion 

Spectral methods have very few parameters that aff'ect their 
accuracy and efficiency. Once the spatial and temporal resolu- 
tions are set correctly, there is little room for a spectral method 
simulation of isotropic turbulence to go wrong. On the other 
hand, Lagrangian vortex methods have a number of tunable pa- 
rameters such as: expansion order in FMM, frequency of par- 
ticle reinitialization, tolerance of the RBF interpolation. The 
eflfect of such parameters on direct numerical simulation of tur- 
bulence (especially when looking at high-order statistics) has 
not been investigated in much detail. In this section, we present 
the results of a wide range of parameter studies for the isotropic 
turbulence benchmark. We use a spectral method code as the 
reference and compare it with a vortex method code by varying 
the tunable parameters. 

5.7. Effect of expansion order in the FMM 

Computing with the FMM using p expansion terms for N 
particles requires O(p^N) work and O(p^N) storage [6], while 
the error decreases exponentially with p. It is obvious that the 
accuracy in the calculation of the velocity and stretching terms 
of the vorticity equation depends on p. However, it is not ap- 
parent how large p actually needs to be in order to calculate a 
turbulence simulation with sufficient accuracy. Therefore, as a 
first test case for our sequence of parameter studies, we vary p 
between 6, 8 and 10. 

The energy spectra for the initial velocity field are shown 
in Figure 1(a). In the legend, "spectral" represents the spectral 
method and "vortex (p=*)" is the vortex method with expansion 
order p = ^. As mentioned in Section 4.2, the initial condi- 
tion for vortex methods is calculated by determining the proper 
vortex strength of each element, such that the sum of the 
velocities induced by all particles matches the initial velocity 
field. Each vortex particle has a Gaussian distribution of vortic- 
ity, and a system of equations is solved to determine the vortex 
strength of all particles so as to reproduce the vorticity field ac- 
curately. Then, the energy spectrum is obtained by calculating 
the velocity field on a lattice using the Biot-Savart law (5) and 
performing a 3-D FFT on this velocity field. Therefore, the dis- 
crepancy in Figure 1(a) between spectral methods and vortex 
methods can be caused by either the inaccuracy of the repre- 
sentation of the velocity field by a superposition of Gaussian 
basis functions, or the inaccuracy of the Biot-Savart calculation 
using FMM. 
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(b) Energy spectra at t/T=20. 

Figure 1 : Initial and time-evolved energy spectra for simulations of decaying 
turbulence at Re;i = 50, with varying expansion order p in the FMM. 



As seen in Figure 1(a), the order of expansion p has a sig- 
nificant eff'ect on the tail of the energy spectrum. This eff'ect is 
somewhat exaggerated in the artificial velocity field at the ini- 
tial step, which has a very sharp drop in energy at high wave 
numbers. At later time steps, as shown in Figure 1(b), the high- 
frequency noise is less prominent, although we see a marked 
diff'erence with varying value of p. We will show in the follow- 
ing subsection that this diff'erence in the high-frequency range 
has little eff'ect on high-order turbulence statistics. 

This issue of /^-dependence was not discussed in previous 
calculations of isotropic turbulence using Lagrangian vortex 
methods [12, 23, 24]. The present simulations also use higher 
spatial resolution relative to the Reynolds number than previ- 
ous works. Therefore, the damping of high-frequency modes is 
not as noticeable, and the energy spectrum matches that of the 
spectral method up to higher wave numbers. This is consistent 
with our previous claims that Lagrangian vortex methods can 
match the results of spectral methods if the spatial resolution is 
sufficient. 

5.2. High-order turbulence statistics 

Using the notation Ux = du/dx, the skewness and flatness of 
the velocity derivative moments are defined by 



u^Jul 



(17) 



where is the skewness, and F4 is the flatness of the velocity 
derivative moment. The time evolution of the skewness and 
flatness of the velocity derivative moments are shown in Figure 
2 for difl'erent values of p. Time is normalized by the large- 
eddy-tumover time T, which is defined as 

T = Liu' (18) 

where the integral length scale L and fluctuating velocity u' are, 
respectively, 

^ " 2^ / ^''^^^^^^ ^^^^ 
1 ^ 

u' = — y u^ + v^ + w? (20) 

i-i 

The large-eddy-turnover time increases as the decaying 
isotropic turbulence simulation proceeds. Therefore, we use 
the initial value of T = 2 to normalize the time. 

Both the skewness and flatness show little variation among 
the vortex method calculations with difl'erent values of In the 
case of the skewness in Figure 2(a), the vortex method matches 
quantitatively with the spectral method throughout the entire 
duration of the simulation to t/T = 20. Since the skewness of 
the velocity derivative is closely related to the cascade of kinetic 
energy from low to high wave numbers. Fig. 2(a) is proof that 
our stretching-term calculation is accurate enough to simulate 
the energy cascade even with a relatively low expansion order 
of p = 6. 

The flatness of the velocity derivative agrees among the three 
vortex method calculations, but there is a discrepancy between 
the spectral method and vortex method in the range t/T = 2- 
10. The fact that the flatness matches better at later times can be 
explained by the decrease in Reynolds number and the increase 
in relative spatial resolution of the turbulence. This supports 
our argument that vortex methods simply need more points to 
represent the same physics, when compared to spectral methods 
(which is not surprising, given that spectral methods ofl'er expo- 
nential convergence). This is also supported by the fact that the 
present vortex method calculation with N = 256^ at Re;i = 50 
has much better agreement in the flatness when compared to 
the previous calculation with N = 64^ at Re;i = 25 [24], due to 
the increase in relative spatial resolution. Further investigations 
should consider the possibility of large eddy simulation (LES) 
with Gaussian filtering, which could be a good match with the 
vortex elements that use Gaussian basis functions. 

5.3. Frequency of reinitialization 

Lagrangian vortex methods converge to the Navier- Stokes 
equation only if there is sufficient overlap between the vor- 
tex particles [4]. In order to maintain overlap of all particles 
in long simulations, particle coordinates must be reinitialized 
onto a regular lattice every few steps. In the present simula- 
tions, we use a core spreading method with RBF interpolation 
for the reinitialization. In previous publications using this ap- 
proach [12, 23, 24], the frequency of remeshing and tolerance 



of the linear solver for RBF interpolation were chosen by trial 
and error. Readers could not know if these parameters were op- 
timal, or how much they aff'ected the accuracy and efficiency of 
the simulation. In the present work, we performed a paramet- 
ric study regarding the frequency of reinitialization, tolerance 
of the linear solver and temporal resolution At, to shed some 
light on this matter and aid reproducibility of accurate turbu- 
lence simulations with the FMM-based vortex method. 

The default parameters in our tests are listed in the previ- 
ous section. We first change the frequency of reinitialization 
to determine its eff'ect on the high-order turbulence statistics. 
The process of RBF interpolation is performed using a very ef- 
ficient approach that re-uses the FMM interaction list to com- 
pute the matrix- vector multiplications. Nevertheless, it requires 
significantly longer runtime than the highly scalable and GPU- 
accelerated Biot-Savart and stretching term calculations using 
the exaFMM code [27]. Therefore, the total runtime can be 
appreciably reduced by minimizing the frequency of reinitial- 
ization, but reducing it too much may hurt accuracy. Figure 
3 shows the eff'ect of reinitialization frequency on the time 
evolution of the skewness and flatness. In the legend, "vor- 
tex (reinit=*)" represents the vortex method calculation that 
reinitializes every ^ steps. The cases with "reinit=10" and 
"reinit=20" give similar results, but "reinit=50" clearly diverts 
from the other two. In the current simulations, reinitialization 
takes approximately ten times longer than the sum of the Biot- 
Savart, stretching, and difl'usion terms for one time step. There- 
fore, reinitializing every 10 steps will reduce the calculation 
runtime by roughly 5 times, compared to doing it on every step. 
It is rather surprising that reinitializing only every 50 steps still 
reproduces high-order turbulence statistics to the degree that it 
does. It is important to point out, however, that isotropic tur- 
bulence is a very special flow field where the distribution of 
vortex particles remains quite regular due to the homogeneous 
and isotropic nature of the convection at small scales. The fre- 
quency of reinitialization and large overlap ratio h/cr = I used 
in the isotropic turbulence simulations would not be applicable 
to any other flow field. 

The reinitialization process can be further accelerated if we 
can aff'ord to use a lower exit tolerance in the iterative solver 
for the RBF interpolation. Results with diff'erent solver toler- 
ances are shown in Figure 4, and we note that skewness and 
flatness are almost indistinguishable for the three cases. We 
use the approximation yj ^ a>i(Ax)^ as an initial guess for the 
solver, where Ax is the distance between the grid points. The 
tolerance is measured by the relative drop of the residual from 
this initial guess. Therefore, Figure 4 indicates that this initial 
guess is not too far from the actual solution. However, just be- 
cause the RBF interpolation requires very little iteration, this 
does not mean that the reinitialization is not necessary. This is 
evident from Figure 3, where the vortex method with infrequent 
reinitialization shows some discrepancy. 

5.4. Temporal resolution 

The advantage of Lagrangian methods is often viewed as the 
ability to use larger time-step sizes. In order to justify this claim 
we have increased and compared the high-order turbulence 




Figure 2: Time evolution of the skewness and flatness for the spectral method and vortex method with difl'erent expansions order of the FMM, p. 
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Figure 3: Time evolution of the skewness and flatness for the spectral method and vortex method with difl'erent frequency of reinitialization. 
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Figure 4: Time evolution of the skewness and flatness for the spectral method and vortex method with difl'erent tolerance for the linear solver. 
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Figure 5: Time evolution of the skewness and flatness for the spectral method and vortex method with difl'erent At. 



Statistics; the skewness and flatness are shown for diff'erent At in 
Figure 5. The frequency of reinitiaHzation is kept constant with 
respect to time and not the number of steps. For example, when 
At is doubled the reinitialization is performed every 5 steps in- 
stead of 10. It can be seen from Figure 5 that increasing At 
to 0.01 does not change the results of the turbulence statistics. 
However, increasing to 0.025 does result in an appreciable 
discrepancy in both the skewness and flatness. It is worth noting 
that A^ = 0.005 is the maximum step size that spectral meth- 
ods can calculate stably, and the Lagrangian vortex method is 
only able to double the step size without significant drawbacks. 
The present results with the vortex method were obtained using 
Ist-order Euler time integration. Using a higher-order time in- 
tegration scheme would likely allow larger time steps without 
harming the turbulence statistics. 

5.5. Overlap ratio of vortex particles 

The overlap ratio /z/cr is an important parameter for vortex 
method simulations. Firstly, maintaining h/cr < 1 is a neces- 
sary condition for the vortex particle method to converge to the 
Navier-Stokes equation. Secondly, this parameter aff'ects the 
ill-conditioning of the RBF interpolation matrix, and therefore 
it aff'ects the calculation time. For simulations of external flows 
with a large variation of particle density, a small overlap ratio 
would be necessary. Note that the overlap ratio is traditionally 
defined in a counterintuitive way, where small overlap ratio ac- 
tually means that the particles are overlapping more. This is 
due to the fact that the convergence proof of the vortex method 
relies on the limit h/o- ^ 0. 

The eff'ective spatial resolution of vortex methods depends 
not on the particle spacing h, but on the core radius cr. Even 
as we increase the number of particles, if cr is not simultane- 
ously decreased, the spatial resolution will not increase. This is 
clearly observed in Figure 6, where the overlap ratio is changed 
from 0.6 to 1.4 for a constant h = 27r/256; h/a = 0.6 means the 
core radius a is the largest of the three, and the eff'ective spa- 
tial resolution is the worst among the three. The flatness of the 
velocity derivative shows a larger discrepancy with the spectral 
method in this case. The skewness is afl'ected less by the spatial 
resolution, as will be discussed in further detail below. 



5.6. Effect of Reynolds number 

In order to check the efl'ect of Reynolds number, we have per- 
formed the same calculations with the spectral method and vor- 
tex method on a 256^ lattice, but for Re;i = 100. The results are 
shown in Figure 7, where it can be seen that the high-order tur- 
bulence statistics do not match as well as in the Re;i = 50 case. 
For example, the skewness of the velocity derivative matches 
very well in Figures 2, 3, 4, and 5, but noticeable difl'erences 
are seen in Figure 7. Also, the flatness of the velocity derivate 
deviates visibly for the Re^^ = 100 case, illustrating the chal- 
lenge of obtaining high-order statistics. As discussed by Schu- 
macher et al. [18], each moment has its own dissipative scale, 
and higher resolution (or lower Reynolds number) is needed to 
resolve higher-order derivatives. The present results also illus- 
trate this point, but with the vortex method: reducing Re^^ from 
100 to 50 with the 256^ lattice makes the skewness match very 
well with the spectral method and lessens the deviation in the 
flatness. 

5. 7. Isosurface of the second invariant 

Figure 8 shows the isosurface of the second invariant of the 
velocity derivative tensor at time t/T = 20 for Re;i = 50 and 
Re;i = 100. Not only do the statistical properties of the tur- 
bulence agree between the vortex method and spectral method, 
but also the instantaneous velocity field itself is almost identi- 
cal. This also confirms that our periodic EMM, which calculates 
the eff'ect of 27^ periodic images with multipole expansions, is 
producing an accurate velocity field even at the edges of the 
domain. 

5.8. Performance 

The wall-clock time of a single time step was 10 seconds on 
a single GPU for the EMM-based vortex method (p = 6), and 1 
second on one CPU socket (6 cores) for the EFT-based pseudo- 
spectral method. Note that calculating the EFT on GPUs would 
not provide a significant performance improvement, since the 
performance increase of cufft over f ftw is small when the 
data transfer between the host and device is taken into account^ . 



^ See http : //www . sharcnet . ca/?merz/CUDA_benchFFT/ 
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Figure 6: Time evolution of the skewness and flatness of du/dx for the spectral method and vortex method with diff^erent h/cr. 




Figure 7: Time evolution of various turbulence statistics for Re^ = 100. 




256 1 



y 256 1 



X 



(a) Spectral method, Re;i = 50 



(b) Vortex method, Re;i = 50 



256 > 




256 



256 




256 



(c) Spectral method, Re^ = 100 (d) Vortex method. Re a = 100 

Figure 8: Isosurface of the second invariant of the velocity derivative tensor, //. 



In a separate publication [27], we report much larger calcu- 
lations with less emphasis on the physics and more emphasis 
on the efficiency and scalabiHty of the codes. We performed 
a simulation of isotropic turbulence at 7?^^ ^ 500 on a 4096^ 
lattice using vortex methods and spectral methods. The FMM- 
based vortex method used 4096 NVIDIA M2050 GPUs on the 
TSUBAME 2.0 system and achieved 74% parallel efficiency, 
while the spectral method reached only 14% parallel efficiency 
on 4096 CPU cores [25]. The wall-clock times of those simula- 
tions were 100 seconds per time step for both the vortex method 
and spectral method, showing that at this level of parallelism 
(and with the help of GPU hardware), the FMM- and FFT-based 
methods may start to compete. 



6. Conclusions 

We have presented results of homogeneous isotropic turbu- 
lence with N = 256^ (almost 17 million particles), at Re;i = 50 
and 100, using a vortex particle method and compared the re- 
sults with a pseudo- spectral method with the same 256^ mesh. 
In particular, we have performed an array of parametric studies 
for the various parameters that affect the accuracy/efficiency of 
our vortex particle method. 

We found that using a lower-order expansion in the FMM 
produces some noise at the higher frequency of the kinetic en- 
ergy spectrum, but has little effect on the overall turbulence 
statistics. For example, with p = 6 the noise at the tail of the 
spectrum is quite large, but the skewness and flatness of the ve- 
locity derivative show little deviation between = 6, 8, and 
10. 



The frequency of reinitialization plays and important role in 
assuring the overlap between vortex elements. For the case of 
homogenous isotropic turbulence, we observed that the homo- 
geneity and isotropy of the flow permits infrequent remeshing 
to a larger extent than other turbulent flows. For the present 
overlap ratio of h/cr = 1, the results of reinitializing every 10 
steps are the same as reinitializing every 20 steps. 

One of the advantages of Lagrangian vortex methods is said 
to be the use of larger time increments, A^. Our studies show 
that high-order turbulence statistics can only be accurately cal- 
culated when is suflficiently small to resolve the small scales 
of turbulence. For this reason, we are only able to double 
At from the stability limit of the spectral method, making the 
advantage quite moderate. This is with first-order Euler inte- 
gration, however, so the use of higher-order time integration 
schemes could allow even larger time step sizes. 

The comparison for difl'erent Reynolds numbers reveals the 
efl'ect of spatial resolution in vortex methods. We are able to 
provide, for the first time, quantitative results indicating the 
number of vortex particles that are needed to reproduce high- 
order turbulence statistics for a given Reynolds number. Our 
conclusion is that at least N = 256^ is necessary to obtain ac- 
curate velocity derivate skewness at Re;i = 50. This conclu- 
sion applies to the vortex method with Gaussian bases, which 
off'er second-order convergence with respect to the core size. 
Higher-order basis functions that should require less resolution 
are available, but they are very sensitive to particle overlap so 
the trade-off' would need to be studied. 

These observations emphasize the importance of the choice 
of parameters when performing a vortex method simulation 
for turbulence. The relative efficiency of vortex methods de- 
pends heavily on each of these parameters, since adjusting them 
makes a large diff'erence in the calculation runtime. Although 
a quantitative assessment of the relative performance between 
vortex methods and spectral methods is beyond the scope of 
this paper, we are able to provide the necessary conditions for 
achieving the required accuracy. This information can be used 
to optimize the parameters in performance studies of vortex 
methods. 

Our current results indicate that the spectral method is an or- 
der of magnitude faster than the vortex method when using a 
single GPU for the FMM and six CPU cores for the FFT. Our 
most recent results, to be published in a separate paper [27], 
show that as the number of GPUs/CPUs increases, the scalabil- 
ity of FMM compared to FFT allows vortex methods to achieve 
higher parallel efficiency. The wall-clock time for solving with 

= 4096^ particles using the vortex method on 4096 GPUs is 
comparable to that of spectral methods using the same number 
of points. 

With these studies and our policy of open-source code, we 
are able to provide a Lagrangian vortex method for the di- 
rect numerical simulation of turbulence that is validated and 
well understood, and results that are reproducible. The entire 
code that was used to obtain the present results is available 
from https : / /bitbucket . org/ exaf mm/ exaf mm. The revi- 
sion number used for the present runs is 146. Documentation 
and links to other publications are found in the project home- 



page at http : //exaf mm . org/. The fact that we compare with 
a highly reliable reference — a pseudo- spectral method — in one 
of the most commonly used benchmarks of turbulence provides 
a concrete starting point for further investigations regarding per- 
formance and scalability of the numerical engines. 
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